Kazutoshi KOBAYASHI Kazuhiko TERADA Hidetoshi ONODERA Keikichi TAMARU
We propose a real-time low-rate video compression algorithm using fixed-rate multi-stage hierarchical vector quantization. Vector quantization is suitable for mobile computing, since it demands small computation on decoding. The proposed algorithm enables transmission of 10 QCIF frames per second over a low-rate 29.2 kbps mobile channel. A frame is hierarchically divided by sub-blocks. A frame of images is compressed in a fixed rate at any video activity. For active frames, large sub-blocks for low resolution are mainly transmitted. For inactive frames, smaller sub-blocks for high resolution can be transmitted successively after a motion-compensated frame. We develop a compression system which consists of a host computer and a memory-based processor for the nearest neighbor search on VQ. Our algorithm guarantees real-time decoding on a poor CPU.
Takashi YUKAWA Sanda M. HARABAGIU Dan I. MOLDOVAN
This paper presents an algorithm for viewpoint-based similarity discernment of linguistic concepts on Semantic Network Array Processor (SNAP). The viewpoint-based similarity discernment plays a key role in retrieving similar propositions. This is useful for advanced knowledge processing areas such as analogical reasoning and case-based reasoning. The algorithm assumes that a knowledge base is constructed for SNAP, based on information acquired from the WordNet linguistic database. The algorithm identifies paths on the knowledge base between each given concept and a given viewpoint concept, then computes a similarity degree between the two concepts based on the number of nodes shared by the paths. A small scale knowledge base was constructed and an experiment was conducted on a SNAP simulator that demonstrated the feasibility of this algorithm. Because of SNAP's scalability, the algorithm is expected to work similarly on a large scale knowledge base.
Hiroyuki KURINO Keiichi HIRANO Taizo ONO Mitsumasa KOYANAGI
We describe a new multiport memory which is called Shared DRAM (SHDRAM) to overcome bus-bottle neck problem in parallel processor system with shared memory. The processors are directly connected to this SHDRAM without conventional common bus. The test chip with 32 kbit memory cells is fabricated using a 1. 5 µm CMOS technology. The basic operation is confirmed by the circuit simulation and experimental results. In addition, it is confirmed by the computer simulation that the system performance with SHDRAM is superior to that with conventional common buses.
Akihiko HASHIGUCHI Masuyoshi KUROKAWA Ken'ichiro NAKAMURA Hiroshi OKUDA Koji AOYAMA Mitsuharu OHKI Katsunori SENO Ichiro KUMATA Masatoshi AIKAWA Hirokazu HANAKI Takao YAMAZAKI Mitsuo SONEDA Seiichiro IWASE
A programmable DSP with linear array architecture for real-time video processing is reported. It achieves a processing rate of 5. 4 GOPS and 81GB/s memory bandwidth using Dual Sense Amplifier architecture. A low-power-supply pipeline decreases power consumption and a time shared bit-line reduces chip area. It has 4320 processor elements and a 1. 1 Mbit 3-port memory. The DSP can be applied to HDTV signals with its 75 MHz peak I/O rate. Sufficient programmability is provided to execute video format conversion such as image size conversion and Y/C separation, and picture quality improvement such as noise reduction and image enhancement. The chip was fabricated using 0. 4 µm CMOS triple metal technology with a 15. 12 mm 14. 95 mm die. It operates at 50 MHz and consumes 0. 53 W/GOPS at 3. 3 V.
Kazutoshi KOBAYASHI Noritsugu NAKAMURA Kazuhiko TERADA Hidetoshi ONODERA Keikichi TAMARU
We have developed and fabricated an LSI called the FMPP-VQ64. The LSI is a memory-based shared-bus SIMD parallel processor containing 64 PEs, intended for low bit-rate image compression using vector quantization. It accelerates the nearest neighbor search (NNS) during vector quantization. The computation time does not depend on the number of code vectors. The FMPP-VQ64 performs 53,000 NNSs per second, while its power dissipation is 20 mW. It can be applied to the mobile telecommunication system.
Jun TAKEDA Ken-ichi TANAKA Kazuo KYUMA
An image recognition system using NEURO4, a programmable parallel processor, is described. Optical flow is the velocity field that an observer detects on a two-dimensional image and gives useful information, such as edges, about moving objects. The processing time for detecting optical flow on the NEURO4 system was analyzed. Owing to the parallel computation scheme, the processing time on the NEURO4 system is proportional to the square root of the size of images, while conventional sequential computers need time in proportion to the size. This analysis was verified by experiments using the NEURO4 system. When the size of an image is 84 84, the NEURO4 system can detect optical flow in less than 10 seconds. In this case the NEURO4 system is 23 times faster than a workstation, Sparc Station 20 (SS20). The larger the size of images becomes, the faster the NEURO4 system can detect optical flow than conventional sequential computers like SS20. Furthermore, the paralleling effect increases in proportion to the number of connected NEURO4 chips by a ring expansion scheme. Therefore, the NEURO4 system is useful for developing moving image recognition algorithms which require a large amount of processing time.
Dingchao LI Yuji IWAHORI Naohiro ISHII
Parallelism on heterogeneous machines brings cost effectiveness, but also raises a new set of complex and challenging problems. This paper addresses the problem of estimating the minimum time taken to execute a program on a fine-grained parallel machine composed of different types of processors. In an earlier publication, we took the first step in this direction by presenting a graph-construction method which partitions a given program into several homogeneous parts and incorporates timing constraints due to heterogeneous parallelism into each part. In this paper, to make the method easier to be applied in a scheduling framework and to demonstrate its practical utility, we present an efficient implementation method and compare the results of its use to the optimal schedule lengths obtained by enumerating all possible solutions. Experimental results for several different machine models indicate that this method can be effectively used to estimate a program's minimum execution time.
Mitsuru MARUYAMA Naohisa TAKAHASHI Takeshi MIEI Tsuyoshi OGURA Tetsuo KAWANO Satoru YAGI
A parallel IP router that uses off-the-shelf wor-kstations and interconnecting switches is presented. This router, called CORErouter-I, is a medium-grained, functionally distributed parallel system consisting of four kinds of processors for routing, routing-table searching, servicing, and line interfacing. Also discussed are issues related to the implementation of CORErouter-I, especially in terms of routing protocol processing and packet-forwarding. Performance characteristics of CORErouter-I are also clarified through several experiments performed to evaluate maximum throughput, analyze packet-forwarding time, and estimate the effect of parallel processing on the route-flapping problem.
Shinhaeng LEE Shin'ichiro OMACHI Hirotomo ASO
Linear programming techniques are useful in many diverse applications such as: production planning, energy distribution etc. To find an optimal solution of the linear programming problem, we have to repeat computations and it takes a lot of processing time. For high speed computation of linear programming, special purpose hardware has been sought. This paper proposes a systolic array for solving linear programming problems using the revised simplex method which is a typical algorithm of linear programming. This paper also proposes a modified systolic array that can solve linear programming problems whose sizes are very large.
Hitoshi YAMAUCHI Takayuki MAEDA Hiroaki KOBAYASHI Tadao NAKAMURA
The multipass rendering method based on the global illumination model can generate the most photo-realistic images. However, since the multipass rendering method is very time consuming, it is impractical in the industrial world. This paper discusses a massively parallel processing approach to fast image synthesis by the multipass rendering method. Especially, we focus on the performance evaluation of the view-dependent object-space parallel processing on the (Mπ)2 which has been proposed in our previous paper. We also propose two kinds of distributed frame buffer system named cached frame buffer and multistage-interconnected frame buffer. These frame buffer systems can solve the access conflict problem on the frame buffer. The simulation results show that the (Mπ)2 has a scalable performance. For example, the (Mπ)2 with more than 4000 processing elements can achieve an efficiency of over 50%. We also show that both of the proposed distributed frame buffer systems can relieve the overhead due to frame buffer access in the (Mπ)2 in the case that a large number of high-performance processing elements are adopted in the system.
Kazutoshi KOBAYASHI Masayoshi KINOSHITA Hidetoshi ONODERA Keikichi TAMARU
We propose a memory-based processor called a Functional Memory Type Parallel Processor for vector quantization (FMPP-VQ). The FMPP-VQ is intended for low bit-rate image compression using vector quantization. It accelerates the nearest neighbor search on vector quantization. In the nearest neighbor search, we look for a vector nearest to an input one among a large number of code vectors. The FMPP-VQ has as many PEs (processing elements, also called "blocks") as code vectors. Thus distances between an input vector and code vectors are computed simultaneously in every PE. The minimum value of all the distances is searched in parallel, as in conventional CAMs. The computation time does not depend on the number of code vectors. In this paper, we explain the detail of the architecture of the FMPP-VQ, its performance and its layout density. We designed and fabricated an LSI including four PEs. The test results and performance estimation of the LSI are also reported.
Takeshi OGURA Mamoru NAKANISHI
This paper describes content addressable memory (CAM) -based hardware that serves as a highly parallel, compact and real-time image-processing system. The novel concept of a highly-parallel integrated circuits and system (HiPIC), in which a large-capacity CAM tuned for parallel data processing is a key element, is introduced. Several hardware algorithms for highly-parallel image processing based on a HiPIC with a CAM are presented in order to demonstrate that the HiPIC concept is effective for compact and real-time image processing. Two kinds of HiPIC-dedicated CAM have been developed. One is embedded on a 0.5-µm CMOS gate array. An embedded CAM up to 64 kbit and logic up to 40 kgate can be integrated on a single chip. The other is a 0.5-µm CMOS full-custom CAM LSI tuned for parallel data processing. A fully-parallel 336-kbit CAM LSI has been successfully developed. The HiPIC concept and CAM-based hardware described here promises to be an important step towards the realization of a compact and real-time image-processing system.
Takafumi AOKI Shinichi SHIONOYA Tatsuo HIGUCHI
This paper explores the potential of multiwave interconnectionsoptical interconnections that employ wavelength components as multiplexable information carriersfor constructing next-generation multiprocessor systems using MCM technology. A hypercube-based multiprocessor network called the multiwave hypercube (MWHC) is proposed, where multiwave interconnections provide highly-flexible dynamic communication channels among processing elements. A performance analysis shows that the use of multiwavelength optics makes possible the reduction of network complexity on an MCM substrate, while supporting low-latency message routing.
Shinji KOMORI Yutaka ARIMA Yoshikazu KONDO Hirono TSUBOTA Ken-ichi TANAKA Kazuo KYUMA
We have developed an SIMD-type neural-network processor (NEURO4) and its software environment. With the SIMD architecture, the chip executes 24 operations in a clock cycle and achieves 1.2 GFLOPS peak performance. An accelerator board, which contains four NEURO4 chips, achieves 3.2 GFLOPS. In this paper we describe features of the neural network chip, accelerator board, software environment and performance evaluation for several neural network models (LVQ, BP and Hopfield). The 3.2 GFLOPS neural network accelerator board demonstrates 1.7 GCPS and 261 MCUPS for Hopfield networks.
Dingchao LI Akira MIZUNO Yuji IWAHORI Naohiro ISHII
This paper describes a new approach to the scheduling problem that assigns tasks of a parallel program described as a task graph onto parallel machines. The approach handles interprocessor communication and heterogeneity, based on using both the theoretical results developed so far and a lookahead scheduling strategy. The experimental results on randomly generated task graphs demonstrate the effectiveness of this scheduling heuristic.
Shietung PENG Igor SEDUKHIN Stanislav SEDUKHIN
In this paper the design of systolic array processors for computing 2-dimensional Discrete Fourier Transform (2-D DFT) is considered. We investigated three different computational schemes for designing systolic array processors using systematic approach. The systematic approach guarantees to find optimal systolic array processors from a large solution space in terms of the number of processing elements and I/O channels, the processing time, topology, pipeline period, etc. The optimal systolic array processors are scalable, modular and suitable for VLSI implementation. An application of the designed systolic array processors to the prime-factor DFT is also presented.
Akimasa YOSHIDA Ken'ichi KOSHIZUKA Wataru OGATA Hironori KASAHARA
This paper proposes a data-localization scheduling scheme inside a processor-cluster for multigrain parallel processing, which hierarchically exploits parallelism among coarsegrain tasks like loops, medium-grain tasks like loop iterations and near-fine-grain tasks like statements. The proposed scheme assigns near-fine-grain or medium-grain tasks inside coarse-grain tasks onto processors inside a processor-cluster so that maximum parallelism can be exploited and inter-processor data transfer can be minimum after data-localization for coarse-grain tasks across processor-clusters. Performance evaluation on a multiprocessor system OSCAR shows that multigrain parallel processing with the proposed data-localization scheduling can reduce execution time for application programs by 10% compared with multigrain parallel processing without data-localization.
Hideyuki ITO Kouichi NAGAMI Tsunemichi SHIOZAWA Kiyoshi OGURI Yukihiro NAKAMURA
We are working on an algorithm to optimize the logic circuits that can be realized on the super fine-grain parallel processing architecture. As a part of this work, we have developed an inverter reduction algorithm. This algorithm is based on modeling logic circuits as dynamical systems. We implement the algorithm in the PARTHENON system, which is the high level synthesis system developed in NTT's laboratories, and evaluate it using ISCAS85 benchmarks. We also compare the results with both the existing algorithm of PARTHENON and the algorithm of Jain and Bryant.
Yoshihiko UEMATSU Koichi MURATA Shinji MATSUOKA
This paper proposes a parallel word alignment procedure for m Binary with 1 Complement Insertion (mBlC) or Differential m Binary with l Mark Insertion (DmBlM) line code. In the proposed procedure for mBlC line code, the word alignment circuit searches (m+1) bit pairs in parallel for complementary relationships. A Signal Flow Graph Model for the parallel word alignment procedure is also proposed, and its performance attributes are numerically analyzed. The attributes are compared with those of the conventional bit-by-bit procedure, and it is shown that the proposed procedure displays superior performance in terms of False-Alignment Probability and Maximum Average Aligning Time. The proposed procedure is suitable for high speed optical data links, because it can be easily implemented using a parallel signal processor operating at a clock rate equal to 1/(m+1) times the mBlC line rate.
Shoji KAWAHITO Makoto YOSHIDA Yoshiaki TADOKORO Akira MATSUZAWA
This paper presents an analog 2-dimensional discrete cosine transform (2-D DCT) processor for focal-plane image compression. The on-chip analog 2-D DCT processor can process directly the analog signal of the CMOS image sensor. The analog-to-digital conversion (ADC) is preformed after the 2-D DCT, and this leads to efficient AD conversion of video signals. Most of the 2-D DCT coefficients can be digitized by a relatively low-resolution ADC or a zero detector. The quantization process after the 2-D DCT can be realized by the ADC at the same time. The 88-point analog 2-D DCT processor is designed by switched-capacitor (SC) coefficient multipliers and an SC analog memory based on 0.35µm CMOS technology. The 2-D DCT processor has sufficient precision, high processing speed, low power dissipation, and small silicon area. The resulting smart image sensor chips with data compression and digital transmission functions are useful for the high-speed image acquisition devices and portable digital video camera systems.